Accuracy: 33.52%
Average IoU: 0.3019
Correct Predictions: 59/176
Accuracy: 35.80%
Average IoU: 0.3258
Correct Predictions: 63/176
Accuracy Difference: +2.27%
IoU Difference: +0.0239
Total Examples: 201
Dataset: refcocos_test
Caption: the computer screen that is in the middle layer
Image: val2017/000000547144.jpg
Ground Truth Without CD With CD
Generation Time: 5.25s
Predicted bbox: [162, 155, 306, 292]
Ground truth: [297.0, 345.0, 427.0, 440.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.37s
Predicted bbox: [160, 156, 306, 285]
Ground truth: [297.0, 345.0, 427.0, 440.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person on the right hand side of the boy not wearing hat
Image: val2017/000000001000.jpg
Ground Truth Without CD With CD
Generation Time: 4.50s
Predicted bbox: [513, 188, 644, 476]
Ground truth: [386.0, 156.0, 461.0, 478.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.92s
Predicted bbox: [525, 189, 644, 476]
Ground truth: [386.0, 156.0, 461.0, 478.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: background person not leaning back
Image: val2017/000000006471.jpg
Ground Truth Without CD With CD
Generation Time: 4.20s
Predicted bbox: [173, 76, 244, 136]
Ground truth: [20.0, 98.0, 70.0, 149.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.66s
Predicted bbox: [173, 74, 256, 136]
Ground truth: [20.0, 98.0, 70.0, 149.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: non-glass cup
Image: val2017/000000002157.jpg
Ground Truth Without CD With CD
Generation Time: 4.07s
Predicted bbox: [0, 119, 55, 262]
Ground truth: [3.0, 121.0, 67.0, 267.0]
Result: Correct
IoU: 0.7423
Generation Time: 8.17s
Predicted bbox: [2, 93, 58, 272]
Ground truth: [3.0, 121.0, 67.0, 267.0]
Result: Correct
IoU: 0.7082
Dataset: refcocos_test
Caption: person not holding anything
Image: val2017/000000009590.jpg
Ground Truth Without CD With CD
Generation Time: 4.60s
Predicted bbox: [368, 167, 456, 258]
Ground truth: [255.0, 179.0, 330.0, 254.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.49s
Predicted bbox: [368, 168, 458, 260]
Ground truth: [255.0, 179.0, 330.0, 254.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the woman looking at an apple laptop
Image: val2017/000000009400.jpg
Ground Truth Without CD With CD
Generation Time: 6.93s
Predicted bbox: [457, 89, 644, 394]
Ground truth: [1.0, 93.0, 114.0, 213.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 16.12s
Predicted bbox: [445, 89, 644, 403]
Ground truth: [1.0, 93.0, 114.0, 213.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person helding nothing
Image: val2017/000000010707.jpg
Ground Truth Without CD With CD
Generation Time: 4.24s
Predicted bbox: [420, 0, 603, 476]
Ground truth: [347.0, 190.0, 478.0, 477.0]
Result: Incorrect
IoU: 0.1534
Generation Time: 8.77s
Predicted bbox: [323, 188, 480, 476]
Ground truth: [347.0, 190.0, 478.0, 477.0]
Result: Correct
IoU: 0.8262
Dataset: refcocos_test
Caption: suitcase next to car wheel
Image: val2017/000000009891.jpg
Ground Truth Without CD With CD
Generation Time: 3.99s
Predicted bbox: [424, 240, 498, 376]
Ground truth: [419.0, 245.0, 495.0, 350.0]
Result: Correct
IoU: 0.7040
Generation Time: 8.11s
Predicted bbox: [422, 240, 497, 377]
Ground truth: [419.0, 245.0, 495.0, 350.0]
Result: Correct
IoU: 0.7238
Dataset: refcocos_test
Caption: the person who is on the phone
Image: val2017/000000012670.jpg
Ground Truth Without CD With CD
Generation Time: 4.81s
Predicted bbox: [104, 122, 201, 271]
Ground truth: [100.0, 122.0, 199.0, 275.0]
Result: Correct
IoU: 0.9165
Generation Time: 10.82s
Predicted bbox: [105, 119, 199, 281]
Ground truth: [100.0, 122.0, 199.0, 275.0]
Result: Correct
IoU: 0.8993
Dataset: refcocos_test
Caption: the person who is wearing a necklace
Image: val2017/000000015335.jpg
Ground Truth Without CD With CD
Generation Time: 4.25s
Predicted bbox: [1, 74, 236, 476]
Ground truth: [3.0, 72.0, 219.0, 366.0]
Result: Correct
IoU: 0.6646
Generation Time: 8.15s
Predicted bbox: [1, 67, 221, 354]
Ground truth: [3.0, 72.0, 219.0, 366.0]
Result: Correct
IoU: 0.9267
Dataset: refcocos_test
Caption: image containing fork and knife
Image: val2017/000000008629.jpg
Ground Truth Without CD With CD
Generation Time: 4.54s
Predicted bbox: [430, 223, 626, 422]
Ground truth: [430.0, 226.0, 621.0, 417.0]
Result: Correct
IoU: 0.9353
Generation Time: 12.06s
Predicted bbox: [433, 224, 626, 420]
Ground truth: [430.0, 226.0, 621.0, 417.0]
Result: Correct
IoU: 0.9351
Dataset: refcocos_test
Caption: person outside the middle window
Image: val2017/000000000139.jpg
Ground Truth Without CD With CD
Generation Time: 4.12s
Predicted bbox: [387, 168, 402, 206]
Ground truth: None
Result: Unable to determine
Generation Time: 8.56s
Predicted bbox: [387, 169, 404, 205]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: person on bed
Image: val2017/000000000632.jpg
Ground Truth Without CD With CD
Generation Time: 4.35s
Predicted bbox: [0, 279, 404, 470]
Ground truth: None
Result: Unable to determine
Generation Time: 8.41s
Predicted bbox: [0, 277, 402, 476]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: letter X
Image: val2017/000000000885.jpg
Ground Truth Without CD With CD
Generation Time: 4.56s
Predicted bbox: [358, 70, 453, 120]
Ground truth: None
Result: Unable to determine
Generation Time: 8.99s
Predicted bbox: [447, 209, 520, 238]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: person holding a goose feather
Image: val2017/000000001268.jpg
Ground Truth Without CD With CD
Generation Time: 4.61s
Predicted bbox: [508, 79, 644, 415]
Ground truth: [20.0, 213.0, 81.0, 285.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.56s
Predicted bbox: [507, 80, 644, 415]
Ground truth: [20.0, 213.0, 81.0, 285.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: sign to 101 south
Image: val2017/000000001532.jpg
Ground Truth Without CD With CD
Generation Time: 6.21s
Predicted bbox: [71, 210, 291, 290]
Ground truth: None
Result: Unable to determine
Generation Time: 11.81s
Predicted bbox: [72, 214, 287, 290]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: child in second row from camera, third from left
Image: val2017/000000002299.jpg
Ground Truth Without CD With CD
Generation Time: 10.81s
Predicted bbox: None
Ground truth: [77.0, 135.0, 117.0, 234.0]
Result: Unable to determine
Generation Time: 11.84s
Predicted bbox: [180, 114, 230, 242]
Ground truth: [77.0, 135.0, 117.0, 234.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the ski pole held by the left hand of the person in the air
Image: val2017/000000002473.jpg
Ground Truth Without CD With CD
Generation Time: 6.49s
Predicted bbox: [353, 116, 396, 190]
Ground truth: [220.0, 117.0, 257.0, 167.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.52s
Predicted bbox: [360, 116, 396, 191]
Ground truth: [220.0, 117.0, 257.0, 167.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the shoes worn by the person with black hoodie
Image: val2017/000000002685.jpg
Ground Truth Without CD With CD
Generation Time: 5.02s
Predicted bbox: [399, 315, 454, 347]
Ground truth: [524.0, 370.0, 634.0, 438.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.80s
Predicted bbox: [397, 313, 452, 348]
Ground truth: [524.0, 370.0, 634.0, 438.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: baby corn
Image: val2017/000000003845.jpg
Ground Truth Without CD With CD
Generation Time: 4.13s
Predicted bbox: [79, 163, 150, 223]
Ground truth: [96.0, 163.0, 141.0, 240.0]
Result: Correct
IoU: 0.5373
Generation Time: 8.35s
Predicted bbox: [74, 146, 147, 215]
Ground truth: [96.0, 163.0, 141.0, 240.0]
Result: Incorrect
IoU: 0.3797
Dataset: refcocos_test
Caption: the man farthest from the camera
Image: val2017/000000003934.jpg
Ground Truth Without CD With CD
Generation Time: 4.76s
Predicted bbox: [94, 91, 152, 226]
Ground truth: [297.0, 133.0, 326.0, 221.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.20s
Predicted bbox: [155, 85, 200, 230]
Ground truth: [297.0, 133.0, 326.0, 221.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person behind the lady in orange
Image: val2017/000000005001.jpg
Ground Truth Without CD With CD
Generation Time: 4.89s
Predicted bbox: [433, 26, 509, 155]
Ground truth: [425.0, 25.0, 506.0, 164.0]
Result: Correct
IoU: 0.8086
Generation Time: 8.39s
Predicted bbox: [428, 28, 512, 156]
Ground truth: [425.0, 25.0, 506.0, 164.0]
Result: Correct
IoU: 0.8301
Dataset: refcocos_test
Caption: the person who is not facing the camera and not holding it
Image: val2017/000000005193.jpg
Ground Truth Without CD With CD
Generation Time: 4.42s
Predicted bbox: [0, 86, 224, 415]
Ground truth: [224.0, 67.0, 265.0, 185.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.21s
Predicted bbox: [2, 89, 219, 415]
Ground truth: [224.0, 67.0, 265.0, 185.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the object held by the person on the right hand side of the person in red
Image: val2017/000000013291.jpg
Ground Truth Without CD With CD
Generation Time: 5.92s
Predicted bbox: [263, 170, 298, 205]
Ground truth: [182.0, 199.0, 217.0, 232.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.05s
Predicted bbox: [240, 170, 292, 205]
Ground truth: [182.0, 199.0, 217.0, 232.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person sitting on the left side of the red chair
Image: val2017/000000014439.jpg
Ground Truth Without CD With CD
Generation Time: 4.18s
Predicted bbox: [59, 128, 120, 169]
Ground truth: [23.0, 120.0, 64.0, 155.0]
Result: Incorrect
IoU: 0.0355
Generation Time: 9.61s
Predicted bbox: [63, 127, 114, 168]
Ground truth: [23.0, 120.0, 64.0, 155.0]
Result: Incorrect
IoU: 0.0080
Dataset: refcocos_test
Caption: the second worker from the right
Image: val2017/000000014473.jpg
Ground Truth Without CD With CD
Generation Time: 5.62s
Predicted bbox: [295, 267, 314, 307]
Ground truth: [273.0, 272.0, 300.0, 310.0]
Result: Incorrect
IoU: 0.1086
Generation Time: 9.51s
Predicted bbox: [317, 267, 339, 310]
Ground truth: [273.0, 272.0, 300.0, 310.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the frisbee that the child in blue looking at
Image: val2017/000000006954.jpg
Ground Truth Without CD With CD
Generation Time: 4.20s
Predicted bbox: [466, 235, 611, 364]
Ground truth: [248.0, 228.0, 366.0, 345.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.46s
Predicted bbox: [465, 232, 603, 361]
Ground truth: [248.0, 228.0, 366.0, 345.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the glass behind the flower
Image: val2017/000000007818.jpg
Ground Truth Without CD With CD
Generation Time: 3.70s
Predicted bbox: [347, 158, 419, 299]
Ground truth: [402.0, 187.0, 445.0, 292.0]
Result: Incorrect
IoU: 0.1386
Generation Time: 8.02s
Predicted bbox: [346, 158, 417, 293]
Ground truth: [402.0, 187.0, 445.0, 292.0]
Result: Incorrect
IoU: 0.1257
Dataset: refcocos_test
Caption: person other than the man and his reflection
Image: val2017/000000009483.jpg
Ground Truth Without CD With CD
Generation Time: 4.36s
Predicted bbox: [301, 73, 380, 260]
Ground truth: None
Result: Unable to determine
Generation Time: 8.09s
Predicted bbox: [300, 73, 384, 263]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: yellow flag next to the middle clownfish flag
Image: val2017/000000017959.jpg
Ground Truth Without CD With CD
Generation Time: 5.17s
Predicted bbox: [208, 264, 330, 403]
Ground truth: None
Result: Unable to determine
Generation Time: 8.53s
Predicted bbox: [22, 198, 103, 400]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: third motorcycle from the left
Image: val2017/000000019109.jpg
Ground Truth Without CD With CD
Generation Time: 4.62s
Predicted bbox: [187, 252, 324, 354]
Ground truth: [138.0, 261.0, 189.0, 375.0]
Result: Incorrect
IoU: 0.0095
Generation Time: 10.23s
Predicted bbox: [195, 253, 335, 357]
Ground truth: [138.0, 261.0, 189.0, 375.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person next to the stairs
Image: val2017/000000018380.jpg
Ground Truth Without CD With CD
Generation Time: 3.98s
Predicted bbox: [236, 33, 286, 113]
Ground truth: [229.0, 36.0, 278.0, 120.0]
Result: Correct
IoU: 0.6624
Generation Time: 9.65s
Predicted bbox: [235, 34, 283, 133]
Ground truth: [229.0, 36.0, 278.0, 120.0]
Result: Correct
IoU: 0.6872
Dataset: refcocos_test
Caption: the person outside the fence who is not sitting
Image: val2017/000000018491.jpg
Ground Truth Without CD With CD
Generation Time: 6.25s
Predicted bbox: [133, 30, 166, 120]
Ground truth: [128.0, 32.0, 165.0, 145.0]
Result: Correct
IoU: 0.6496
Generation Time: 8.66s
Predicted bbox: [330, 72, 392, 200]
Ground truth: [128.0, 32.0, 165.0, 145.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person on the surfboard which is not pink or yellow
Image: val2017/000000081988.jpg
Ground Truth Without CD With CD
Generation Time: 6.74s
Predicted bbox: [477, 310, 593, 394]
Ground truth: [45.0, 284.0, 160.0, 394.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.65s
Predicted bbox: [49, 281, 160, 386]
Ground truth: [45.0, 284.0, 160.0, 394.0]
Result: Correct
IoU: 0.8721
Dataset: refcocos_test
Caption: a burned hotdog
Image: val2017/000000083531.jpg
Ground Truth Without CD With CD
Generation Time: 3.79s
Predicted bbox: [344, 188, 412, 205]
Ground truth: [343.0, 159.0, 404.0, 178.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.47s
Predicted bbox: [305, 192, 486, 230]
Ground truth: [343.0, 159.0, 404.0, 178.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the three people on the balcony right above crowd, not on the ground
Image: val2017/000000084031.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [122, 10, 644, 156]
Ground truth: [250.0, 126.0, 277.0, 152.0]
Result: Incorrect
IoU: 0.0092
Generation Time: 10.30s
Predicted bbox: [395, 13, 416, 28]
Ground truth: [250.0, 126.0, 277.0, 152.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the smaller pot in front of the cooking pan
Image: val2017/000000084241.jpg
Ground Truth Without CD With CD
Generation Time: 4.87s
Predicted bbox: [241, 323, 347, 392]
Ground truth: [238.0, 326.0, 321.0, 394.0]
Result: Correct
IoU: 0.6877
Generation Time: 10.95s
Predicted bbox: [238, 327, 339, 391]
Ground truth: [238.0, 326.0, 321.0, 394.0]
Result: Correct
IoU: 0.7816
Dataset: refcocos_test
Caption: person wearing striped shirt without long hair
Image: val2017/000000085157.jpg
Ground Truth Without CD With CD
Generation Time: 3.82s
Predicted bbox: [462, 93, 639, 438]
Ground truth: [422.0, 97.0, 630.0, 455.0]
Result: Correct
IoU: 0.7322
Generation Time: 12.70s
Predicted bbox: [456, 90, 641, 442]
Ground truth: [422.0, 97.0, 630.0, 455.0]
Result: Correct
IoU: 0.7546
Dataset: refcocos_test
Caption: the carbinets on the top of the microwave
Image: val2017/000000091615.jpg
Ground Truth Without CD With CD
Generation Time: 4.05s
Predicted bbox: [465, 0, 644, 120]
Ground truth: [479.0, 1.0, 626.0, 51.0]
Result: Incorrect
IoU: 0.3422
Generation Time: 9.67s
Predicted bbox: [473, 0, 644, 121]
Ground truth: [479.0, 1.0, 626.0, 51.0]
Result: Incorrect
IoU: 0.3552
Dataset: refcocos_test
Caption: hotdog without vegetables on it
Image: val2017/000000091779.jpg
Ground Truth Without CD With CD
Generation Time: 4.05s
Predicted bbox: [1, 46, 328, 164]
Ground truth: [110.0, 99.0, 472.0, 320.0]
Result: Incorrect
IoU: 0.1357
Generation Time: 7.76s
Predicted bbox: [0, 45, 331, 167]
Ground truth: [110.0, 99.0, 472.0, 320.0]
Result: Incorrect
IoU: 0.1426
Dataset: refcocos_test
Caption: dish seems to have the least amount
Image: val2017/000000092053.jpg
Ground Truth Without CD With CD
Generation Time: 6.02s
Predicted bbox: [373, 79, 644, 243]
Ground truth: [370.0, 84.0, 637.0, 248.0]
Result: Correct
IoU: 0.9075
Generation Time: 8.25s
Predicted bbox: [369, 80, 644, 245]
Ground truth: [370.0, 84.0, 637.0, 248.0]
Result: Correct
IoU: 0.9309
Dataset: refcocos_test
Caption: black board that does not have a number on it
Image: val2017/000000094185.jpg
Ground Truth Without CD With CD
Generation Time: 4.86s
Predicted bbox: [556, 173, 603, 355]
Ground truth: [551.0, 180.0, 604.0, 352.0]
Result: Correct
IoU: 0.8433
Generation Time: 10.25s
Predicted bbox: [562, 179, 617, 347]
Ground truth: [551.0, 180.0, 604.0, 352.0]
Result: Correct
IoU: 0.6184
Dataset: refcocos_test
Caption: person holding up a frisbee and not wearing a bag
Image: val2017/000000100238.jpg
Ground Truth Without CD With CD
Generation Time: 6.31s
Predicted bbox: [357, 0, 533, 476]
Ground truth: [8.0, 27.0, 207.0, 475.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.90s
Predicted bbox: [10, 27, 214, 476]
Ground truth: [8.0, 27.0, 207.0, 475.0]
Result: Correct
IoU: 0.9542
Dataset: refcocos_test
Caption: ice cream next to the potato
Image: val2017/000000104669.jpg
Ground Truth Without CD With CD
Generation Time: 3.75s
Predicted bbox: [292, 71, 345, 127]
Ground truth: None
Result: Unable to determine
Generation Time: 6.46s
Predicted bbox: [290, 73, 353, 127]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the person partially obscured by the person in red shorts
Image: val2017/000000105264.jpg
Ground Truth Without CD With CD
Generation Time: 4.55s
Predicted bbox: [443, 195, 470, 291]
Ground truth: [437.0, 196.0, 469.0, 295.0]
Result: Correct
IoU: 0.7508
Generation Time: 8.46s
Predicted bbox: [445, 195, 473, 293]
Ground truth: [437.0, 196.0, 469.0, 295.0]
Result: Correct
IoU: 0.6496
Dataset: refcocos_test
Caption: the second car behind the car with two open doors
Image: val2017/000000111086.jpg
Ground Truth Without CD With CD
Generation Time: 4.00s
Predicted bbox: [164, 226, 294, 284]
Ground truth: [161.0, 232.0, 226.0, 281.0]
Result: Incorrect
IoU: 0.3952
Generation Time: 10.21s
Predicted bbox: [206, 221, 299, 285]
Ground truth: [161.0, 232.0, 226.0, 281.0]
Result: Incorrect
IoU: 0.1201
Dataset: refcocos_test
Caption: the suitcase own by a person holding food in hand
Image: val2017/000000114049.jpg
Ground Truth Without CD With CD
Generation Time: 4.40s
Predicted bbox: [162, 392, 295, 636]
Ground truth: [131.0, 327.0, 236.0, 571.0]
Result: Incorrect
IoU: 0.2955
Generation Time: 13.09s
Predicted bbox: [155, 353, 296, 637]
Ground truth: [131.0, 327.0, 236.0, 571.0]
Result: Incorrect
IoU: 0.3678
Dataset: refcocos_test
Caption: the bus next to the bus with a different color
Image: val2017/000000114884.jpg
Ground Truth Without CD With CD
Generation Time: 4.57s
Predicted bbox: [175, 75, 270, 138]
Ground truth: [215.0, 73.0, 273.0, 133.0]
Result: Correct
IoU: 0.5084
Generation Time: 8.17s
Predicted bbox: [162, 68, 271, 136]
Ground truth: [215.0, 73.0, 273.0, 133.0]
Result: Incorrect
IoU: 0.4461
Dataset: refcocos_test
Caption: woman with hat
Image: val2017/000000115870.jpg
Ground Truth Without CD With CD
Generation Time: 4.52s
Predicted bbox: [58, 178, 154, 294]
Ground truth: None
Result: Unable to determine
Generation Time: 9.23s
Predicted bbox: [274, 103, 336, 178]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the elephant fifth farthest from the camera
Image: val2017/000000119641.jpg
Ground Truth Without CD With CD
Generation Time: 6.06s
Predicted bbox: [501, 376, 541, 433]
Ground truth: [502.0, 385.0, 537.0, 435.0]
Result: Correct
IoU: 0.7149
Generation Time: 13.60s
Predicted bbox: [499, 373, 540, 434]
Ground truth: [502.0, 385.0, 537.0, 435.0]
Result: Correct
IoU: 0.6763
Dataset: refcocos_test
Caption: horse at left rear of the horse ride by a man wearing shirt
Image: val2017/000000121031.jpg
Ground Truth Without CD With CD
Generation Time: 6.49s
Predicted bbox: [220, 190, 295, 305]
Ground truth: [387.0, 187.0, 442.0, 260.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.96s
Predicted bbox: [225, 190, 294, 313]
Ground truth: [387.0, 187.0, 442.0, 260.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: person in yellow jersy
Image: val2017/000000123213.jpg
Ground Truth Without CD With CD
Generation Time: 3.28s
Predicted bbox: [376, 0, 460, 80]
Ground truth: None
Result: Unable to determine
Generation Time: 8.77s
Predicted bbox: [367, 0, 459, 84]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the doll in front of a book whose name is not the office and not monk
Image: val2017/000000125062.jpg
Ground Truth Without CD With CD
Generation Time: 7.51s
Predicted bbox: [148, 320, 417, 627]
Ground truth: [1.0, 224.0, 127.0, 445.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.96s
Predicted bbox: [147, 326, 416, 625]
Ground truth: [1.0, 224.0, 127.0, 445.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: chips that is neither red nor green
Image: val2017/000000125936.jpg
Ground Truth Without CD With CD
Generation Time: 4.70s
Predicted bbox: [218, 149, 302, 201]
Ground truth: [255.0, 107.0, 316.0, 136.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.54s
Predicted bbox: [234, 173, 299, 200]
Ground truth: [255.0, 107.0, 316.0, 136.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the cabinet above the white rice cooker
Image: val2017/000000127182.jpg
Ground Truth Without CD With CD
Generation Time: 7.23s
Predicted bbox: [178, 120, 327, 276]
Ground truth: [187.0, 64.0, 325.0, 275.0]
Result: Correct
IoU: 0.6906
Generation Time: 12.23s
Predicted bbox: [178, 67, 318, 273]
Ground truth: [187.0, 64.0, 325.0, 275.0]
Result: Correct
IoU: 0.8713
Dataset: refcocos_test
Caption: the surfboard overlapping two other surfboards
Image: val2017/000000127517.jpg
Ground Truth Without CD With CD
Generation Time: 3.62s
Predicted bbox: [228, 0, 371, 408]
Ground truth: [507.0, 75.0, 578.0, 363.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.31s
Predicted bbox: [228, 0, 374, 403]
Ground truth: [507.0, 75.0, 578.0, 363.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: object behind the couch not facing camera horizontally
Image: val2017/000000128148.jpg
Ground Truth Without CD With CD
Generation Time: 4.28s
Predicted bbox: [76, 112, 154, 220]
Ground truth: [1.0, 189.0, 96.0, 311.0]
Result: Incorrect
IoU: 0.0320
Generation Time: 10.59s
Predicted bbox: [75, 113, 155, 223]
Ground truth: [1.0, 189.0, 96.0, 311.0]
Result: Incorrect
IoU: 0.0363
Dataset: refcocos_test
Caption: the cake decorated with two white swan-like figures, noticeably further apart from each other compared to similar decorations on other cakes
Image: val2017/000000128476.jpg
Ground Truth Without CD With CD
Generation Time: 7.06s
Predicted bbox: [332, 124, 591, 278]
Ground truth: [307.0, 146.0, 593.0, 353.0]
Result: Correct
IoU: 0.5268
Generation Time: 13.61s
Predicted bbox: [317, 126, 588, 331]
Ground truth: [307.0, 146.0, 593.0, 353.0]
Result: Correct
IoU: 0.7758
Dataset: refcocos_test
Caption: the cow furthest from camera
Image: val2017/000000129416.jpg
Ground Truth Without CD With CD
Generation Time: 4.70s
Predicted bbox: [36, 206, 58, 238]
Ground truth: [57.0, 214.0, 74.0, 235.0]
Result: Incorrect
IoU: 0.0202
Generation Time: 8.35s
Predicted bbox: [33, 206, 56, 237]
Ground truth: [57.0, 214.0, 74.0, 235.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: man sitting next to number 25 with his mouth open
Image: val2017/000000133969.jpg
Ground Truth Without CD With CD
Generation Time: 4.85s
Predicted bbox: [262, 109, 328, 312]
Ground truth: [214.0, 176.0, 288.0, 315.0]
Result: Incorrect
IoU: 0.1755
Generation Time: 10.53s
Predicted bbox: [220, 172, 283, 312]
Ground truth: [214.0, 176.0, 288.0, 315.0]
Result: Correct
IoU: 0.8131
Dataset: refcocos_test
Caption: keyboard closest to monitor that is on
Image: val2017/000000135872.jpg
Ground Truth Without CD With CD
Generation Time: 5.17s
Predicted bbox: [312, 167, 408, 207]
Ground truth: [310.0, 166.0, 369.0, 198.0]
Result: Incorrect
IoU: 0.4461
Generation Time: 10.72s
Predicted bbox: [311, 166, 409, 206]
Ground truth: [310.0, 166.0, 369.0, 198.0]
Result: Incorrect
IoU: 0.4696
Dataset: refcocos_test
Caption: cow closest to the one sticking out tongue and doesn't have brown skin
Image: val2017/000000137576.jpg
Ground Truth Without CD With CD
Generation Time: 5.28s
Predicted bbox: [38, 180, 240, 367]
Ground truth: [0.0, 304.0, 121.0, 489.0]
Result: Incorrect
IoU: 0.0952
Generation Time: 11.19s
Predicted bbox: [37, 180, 240, 355]
Ground truth: [0.0, 304.0, 121.0, 489.0]
Result: Incorrect
IoU: 0.0799
Dataset: refcocos_test
Caption: the watermelon behind the one that is being held
Image: val2017/000000139099.jpg
Ground Truth Without CD With CD
Generation Time: 4.39s
Predicted bbox: [59, 348, 205, 401]
Ground truth: [43.0, 391.0, 182.0, 411.0]
Result: Incorrect
IoU: 0.1324
Generation Time: 12.99s
Predicted bbox: [64, 345, 211, 404]
Ground truth: [43.0, 391.0, 182.0, 411.0]
Result: Incorrect
IoU: 0.1547
Dataset: refcocos_test
Caption: third biggest decoration on left wall
Image: val2017/000000139684.jpg
Ground Truth Without CD With CD
Generation Time: 3.83s
Predicted bbox: [86, 15, 111, 68]
Ground truth: [86.0, 18.0, 110.0, 70.0]
Result: Correct
IoU: 0.8740
Generation Time: 14.42s
Predicted bbox: [84, 17, 111, 70]
Ground truth: [86.0, 18.0, 110.0, 70.0]
Result: Correct
IoU: 0.8721
Dataset: refcocos_test
Caption: object under the wrench
Image: val2017/000000140556.jpg
Ground Truth Without CD With CD
Generation Time: 4.00s
Predicted bbox: [313, 233, 497, 448]
Ground truth: [389.0, 244.0, 487.0, 456.0]
Result: Incorrect
IoU: 0.4955
Generation Time: 9.41s
Predicted bbox: [327, 237, 497, 442]
Ground truth: [389.0, 244.0, 487.0, 456.0]
Result: Correct
IoU: 0.5357
Dataset: refcocos_test
Caption: object being cut by lady in middle
Image: val2017/000000140640.jpg
Ground Truth Without CD With CD
Generation Time: 3.62s
Predicted bbox: [426, 356, 591, 415]
Ground truth: [463.0, 367.0, 593.0, 423.0]
Result: Correct
IoU: 0.5652
Generation Time: 8.24s
Predicted bbox: [433, 355, 593, 414]
Ground truth: [463.0, 367.0, 593.0, 423.0]
Result: Correct
IoU: 0.5759
Dataset: refcocos_test
Caption: the kite on the left of english flag
Image: val2017/000000140840.jpg
Ground Truth Without CD With CD
Generation Time: 3.90s
Predicted bbox: [218, 169, 288, 251]
Ground truth: [138.0, 175.0, 199.0, 238.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.00s
Predicted bbox: [212, 164, 292, 248]
Ground truth: [138.0, 175.0, 199.0, 238.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person not on the same side as man with hat
Image: val2017/000000115870.jpg
Ground Truth Without CD With CD
Generation Time: 6.07s
Predicted bbox: [189, 240, 623, 416]
Ground truth: [273.0, 103.0, 333.0, 181.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.86s
Predicted bbox: [26, 181, 109, 296]
Ground truth: [273.0, 103.0, 333.0, 181.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: person who is on a bicycle but not riding it
Image: val2017/000000142324.jpg
Ground Truth Without CD With CD
Generation Time: 4.65s
Predicted bbox: [199, 170, 262, 333]
Ground truth: [284.0, 194.0, 330.0, 291.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.67s
Predicted bbox: [203, 170, 261, 338]
Ground truth: [284.0, 194.0, 330.0, 291.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the cap of the portable stove
Image: val2017/000000142620.jpg
Ground Truth Without CD With CD
Generation Time: 3.90s
Predicted bbox: [75, 289, 121, 352]
Ground truth: [41.0, 369.0, 101.0, 422.0]
Result: Incorrect
IoU: 0.0000
Generation Time: 7.95s
Predicted bbox: [76, 293, 121, 354]
Ground truth: [41.0, 369.0, 101.0, 422.0]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the man the woman with a translucent veil looking at
Image: val2017/000000143961.jpg
Ground Truth Without CD With CD
Generation Time: 4.98s
Predicted bbox: [0, 145, 73, 365]
Ground truth: [0.0, 138.0, 73.0, 325.0]
Result: Correct
IoU: 0.7930
Generation Time: 12.53s
Predicted bbox: [1, 144, 125, 352]
Ground truth: [0.0, 138.0, 73.0, 325.0]
Result: Incorrect
IoU: 0.4934
Dataset: refcocos_test
Caption: person sitting at 3 o'clock position on picnic mat
Image: val2017/000000145597.jpg
Ground Truth Without CD With CD
Generation Time: 5.01s
Predicted bbox: [356, 125, 630, 476]
Ground truth: [480.0, 35.0, 639.0, 256.0]
Result: Incorrect
IoU: 0.1760
Generation Time: 10.05s
Predicted bbox: [555, 32, 644, 265]
Ground truth: [480.0, 35.0, 639.0, 256.0]
Result: Incorrect
IoU: 0.4975
Dataset: refcocos_test
Caption: a bowl whose exterior is neither red nor white
Image: val2017/000000494869.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [337, 236, 377, 266]
Ground truth: [342.29, 236.85, 392.62, 265.01]
Result: Correct
IoU: 0.5960
Generation Time: 8.46s
Predicted bbox: [334, 235, 381, 264]
Ground truth: [342.29, 236.85, 392.62, 265.01]
Result: Correct
IoU: 0.6077
Dataset: refcocos_test
Caption: the person wearing sneakers that are not blue
Image: val2017/000000554002.jpg
Ground Truth Without CD With CD
Generation Time: 4.62s
Predicted bbox: [16, 0, 113, 252]
Ground truth: [19.14, 2.39, 109.12, 257.97]
Result: Correct
IoU: 0.8991
Generation Time: 9.02s
Predicted bbox: [21, 1, 116, 250]
Ground truth: [19.14, 2.39, 109.12, 257.97]
Result: Correct
IoU: 0.8787
Dataset: refcocos_test
Caption: the car has a cat wearing a red scarf around its neck
Image: val2017/000000078823.jpg
Ground Truth Without CD With CD
Generation Time: 4.25s
Predicted bbox: [197, 118, 366, 339]
Ground truth: None
Result: Unable to determine
Generation Time: 8.33s
Predicted bbox: [199, 118, 364, 336]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the knife that is neither held by anyone nor placed on the marble surface
Image: val2017/000000419974.jpg
Ground Truth Without CD With CD
Generation Time: 5.82s
Predicted bbox: [190, 461, 338, 528]
Ground truth: [130.09, 276.33, 146.09, 283.4]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.88s
Predicted bbox: [189, 457, 343, 526]
Ground truth: [130.09, 276.33, 146.09, 283.4]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the bicycle being ridden by a person holding a dog
Image: val2017/000000424162.jpg
Ground Truth Without CD With CD
Generation Time: 5.36s
Predicted bbox: [318, 209, 451, 473]
Ground truth: [305.56, 230.39, 422.68, 474.33]
Result: Correct
IoU: 0.6633
Generation Time: 11.86s
Predicted bbox: [329, 228, 452, 471]
Ground truth: [305.56, 230.39, 422.68, 474.33]
Result: Correct
IoU: 0.6275
Dataset: refcocos_test
Caption: the cup mounted on the wall, located in the second row from the top, at the leftmost position
Image: val2017/000000329219.jpg
Ground Truth Without CD With CD
Generation Time: 7.28s
Predicted bbox: [105, 65, 132, 99]
Ground truth: [331.4, 80.38, 346.26, 97.16999999999999]
Result: Incorrect
IoU: 0.0000
Generation Time: 14.31s
Predicted bbox: [79, 56, 105, 98]
Ground truth: [331.4, 80.38, 346.26, 97.16999999999999]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person in the car who is not sitting in the driver's seat
Image: val2017/000000067213.jpg
Ground Truth Without CD With CD
Generation Time: 4.62s
Predicted bbox: [365, 362, 402, 402]
Ground truth: [277.98, 371.09, 310.22, 409.37]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.60s
Predicted bbox: [361, 358, 402, 402]
Ground truth: [277.98, 371.09, 310.22, 409.37]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the bench that a dog is sitting on
Image: val2017/000000061108.jpg
Ground Truth Without CD With CD
Generation Time: 4.76s
Predicted bbox: [0, 50, 275, 377]
Ground truth: None
Result: Unable to determine
Generation Time: 13.24s
Predicted bbox: [1, 44, 275, 335]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the car located to the left of the car containing the dog
Image: val2017/000000365207.jpg
Ground Truth Without CD With CD
Generation Time: 4.85s
Predicted bbox: [55, 245, 206, 464]
Ground truth: [69.7, 260.42, 211.63, 463.18]
Result: Correct
IoU: 0.8078
Generation Time: 13.60s
Predicted bbox: [65, 245, 211, 461]
Ground truth: [69.7, 260.42, 211.63, 463.18]
Result: Correct
IoU: 0.8865
Dataset: refcocos_test
Caption: the bicycle in the background positioned between the person wearing a black shirt and white pants and the person wearing a black-and-white patterned shirt and shorts, mostly obscured by other objects
Image: val2017/000000279278.jpg
Ground Truth Without CD With CD
Generation Time: 5.68s
Predicted bbox: [466, 26, 610, 142]
Ground truth: [334.76, 48.96, 365.2, 157.71]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.43s
Predicted bbox: [448, 38, 533, 148]
Ground truth: [334.76, 48.96, 365.2, 157.71]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the second hanging potted plant from the right
Image: val2017/000000482100.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [335, 0, 370, 50]
Ground truth: [338.75, 0, 379.54, 39.27]
Result: Correct
IoU: 0.5776
Generation Time: 11.43s
Predicted bbox: [337, 1, 373, 42]
Ground truth: [338.75, 0, 379.54, 39.27]
Result: Correct
IoU: 0.7418
Dataset: refcocos_test
Caption: a watermelon in a bowl placed centrally on the wooden countertop island
Image: val2017/000000540502.jpg
Ground Truth Without CD With CD
Generation Time: 4.90s
Predicted bbox: [334, 198, 381, 227]
Ground truth: None
Result: Unable to determine
Generation Time: 11.45s
Predicted bbox: [337, 203, 385, 227]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the second knife from the top positioned in a knife block
Image: val2017/000000127182.jpg
Ground Truth Without CD With CD
Generation Time: 4.46s
Predicted bbox: [28, 339, 41, 385]
Ground truth: [7.8, 342.05, 37.51, 371.61]
Result: Incorrect
IoU: 0.2352
Generation Time: 12.66s
Predicted bbox: [1, 338, 42, 395]
Ground truth: [7.8, 342.05, 37.51, 371.61]
Result: Incorrect
IoU: 0.3758
Dataset: refcocos_test
Caption: a red bowl that is not located on the top shelf of the right set of cabinets
Image: val2017/000000575970.jpg
Ground Truth Without CD With CD
Generation Time: 4.74s
Predicted bbox: [486, 108, 510, 124]
Ground truth: [276.5, 83.81, 296.74, 90.16]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.46s
Predicted bbox: [489, 109, 512, 123]
Ground truth: [276.5, 83.81, 296.74, 90.16]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a wine glass located on top of the stove
Image: val2017/000000226984.jpg
Ground Truth Without CD With CD
Generation Time: 4.26s
Predicted bbox: [393, 191, 415, 213]
Ground truth: None
Result: Unable to determine
Generation Time: 7.96s
Predicted bbox: [38, 218, 61, 296]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the chair close to the fruit and not next to the refrigerator
Image: val2017/000000037777.jpg
Ground Truth Without CD With CD
Generation Time: 6.08s
Predicted bbox: [0, 198, 97, 252]
Ground truth: [116.5, 189.57, 166.5, 215.07]
Result: Incorrect
IoU: 0.0000
Generation Time: 14.21s
Predicted bbox: [191, 168, 252, 252]
Ground truth: [116.5, 189.57, 166.5, 215.07]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the plant located between a yellow bottle and a blue bottle
Image: val2017/000000491216.jpg
Ground Truth Without CD With CD
Generation Time: 4.38s
Predicted bbox: [270, 185, 301, 245]
Ground truth: [269.55, 180.58, 298.97, 243.18]
Result: Correct
IoU: 0.8360
Generation Time: 9.82s
Predicted bbox: [269, 185, 304, 245]
Ground truth: [269.55, 180.58, 298.97, 243.18]
Result: Correct
IoU: 0.7675
Dataset: refcocos_test
Caption: the plant that is neither hanging nor placed on a kitchen table
Image: val2017/000000136355.jpg
Ground Truth Without CD With CD
Generation Time: 4.69s
Predicted bbox: [455, 173, 534, 271]
Ground truth: [448.77, 175.76, 513.22, 298.76]
Result: Correct
IoU: 0.5477
Generation Time: 9.74s
Predicted bbox: [455, 173, 530, 277]
Ground truth: [448.77, 175.76, 513.22, 298.76]
Result: Correct
IoU: 0.5994
Dataset: refcocos_test
Caption: a cup on the middle shelf of the left wall, surrounded by wine glasses
Image: val2017/000000529568.jpg
Ground Truth Without CD With CD
Generation Time: 6.09s
Predicted bbox: [20, 221, 130, 257]
Ground truth: [66.71, 220.26, 82.78999999999999, 253.76999999999998]
Result: Incorrect
IoU: 0.1327
Generation Time: 13.18s
Predicted bbox: [67, 283, 127, 314]
Ground truth: [66.71, 220.26, 82.78999999999999, 253.76999999999998]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a person wiping her face with a towel
Image: val2017/000000306733.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [163, 172, 224, 303]
Ground truth: None
Result: Unable to determine
Generation Time: 8.81s
Predicted bbox: [164, 173, 227, 305]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the smaller bowl that is yellow
Image: val2017/000000068833.jpg
Ground Truth Without CD With CD
Generation Time: 3.94s
Predicted bbox: [315, 224, 337, 244]
Ground truth: [313.8, 228.19, 335.29, 247.14]
Result: Correct
IoU: 0.6093
Generation Time: 10.32s
Predicted bbox: [316, 223, 337, 245]
Ground truth: [313.8, 228.19, 335.29, 247.14]
Result: Correct
IoU: 0.5950
Dataset: refcocos_test
Caption: the figure of a person that has a solid-colored background that is not white
Image: val2017/000000149222.jpg
Ground Truth Without CD With CD
Generation Time: 4.36s
Predicted bbox: [168, 84, 206, 112]
Ground truth: [236.11, 72.81, 248.48000000000002, 89.11]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.74s
Predicted bbox: [139, 82, 205, 116]
Ground truth: [236.11, 72.81, 248.48000000000002, 89.11]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a display screen that is not showing any content with red color
Image: val2017/000000361586.jpg
Ground Truth Without CD With CD
Generation Time: 5.67s
Predicted bbox: [442, 75, 493, 127]
Ground truth: [19.67, 133.25, 86.2, 211.82]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.28s
Predicted bbox: [25, 137, 104, 213]
Ground truth: [19.67, 133.25, 86.2, 211.82]
Result: Correct
IoU: 0.6883
Dataset: refcocos_test
Caption: the bottle that is not empty and is located on the right side of the flower
Image: val2017/000000186632.jpg
Ground Truth Without CD With CD
Generation Time: 5.37s
Predicted bbox: [396, 387, 439, 443]
Ground truth: None
Result: Unable to determine
Generation Time: 12.89s
Predicted bbox: [398, 385, 438, 440]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the third chair from the left at the dining table
Image: val2017/000000440475.jpg
Ground Truth Without CD With CD
Generation Time: 6.11s
Predicted bbox: [297, 287, 416, 385]
Ground truth: [444.5, 299.5, 542.71, 361.2]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.59s
Predicted bbox: [288, 291, 416, 391]
Ground truth: [444.5, 299.5, 542.71, 361.2]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: an orange cat sitting on the carpet watching tv
Image: val2017/000000240940.jpg
Ground Truth Without CD With CD
Generation Time: 4.64s
Predicted bbox: [115, 310, 214, 504]
Ground truth: None
Result: Unable to determine
Generation Time: 8.87s
Predicted bbox: [116, 310, 210, 504]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the second bicycle that is laying on top of the motorcycle
Image: val2017/000000070774.jpg
Ground Truth Without CD With CD
Generation Time: 4.40s
Predicted bbox: [295, 171, 506, 275]
Ground truth: [261.38, 173.6, 506.79999999999995, 223.64999999999998]
Result: Incorrect
IoU: 0.4462
Generation Time: 9.75s
Predicted bbox: [277, 154, 505, 235]
Ground truth: [261.38, 173.6, 506.79999999999995, 223.64999999999998]
Result: Correct
IoU: 0.5900
Dataset: refcocos_test
Caption: the white pigeon burying its head inside the bread
Image: val2017/000000123585.jpg
Ground Truth Without CD With CD
Generation Time: 4.79s
Predicted bbox: [205, 205, 305, 443]
Ground truth: None
Result: Unable to determine
Generation Time: 12.97s
Predicted bbox: [209, 205, 301, 388]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the car that is neither blue nor on the left side of the road and does not have a cat on it
Image: val2017/000000466156.jpg
Ground Truth Without CD With CD
Generation Time: 5.23s
Predicted bbox: [274, 36, 319, 53]
Ground truth: [274.28, 32.03, 291.21, 40.57]
Result: Incorrect
IoU: 0.0930
Generation Time: 12.84s
Predicted bbox: [246, 29, 271, 44]
Ground truth: [274.28, 32.03, 291.21, 40.57]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the car whose license plate number begins with a digit other than one
Image: val2017/000000172330.jpg
Ground Truth Without CD With CD
Generation Time: 8.51s
Predicted bbox: [475, 82, 644, 371]
Ground truth: [471.74, 79.74, 637.94, 384.26]
Result: Correct
IoU: 0.8993
Generation Time: 10.87s
Predicted bbox: [477, 79, 644, 362]
Ground truth: [471.74, 79.74, 637.94, 384.26]
Result: Correct
IoU: 0.8662
Dataset: refcocos_test
Caption: the second cup next to the red tube
Image: val2017/000000227044.jpg
Ground Truth Without CD With CD
Generation Time: 5.04s
Predicted bbox: [197, 0, 261, 39]
Ground truth: [114.34, 0, 174.74, 43.15]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.32s
Predicted bbox: [196, 0, 262, 38]
Ground truth: [114.34, 0, 174.74, 43.15]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person who is wearing green clothing and is next to the woman wearing a purple shirt
Image: val2017/000000176857.jpg
Ground Truth Without CD With CD
Generation Time: 6.36s
Predicted bbox: [146, 10, 176, 72]
Ground truth: [145.04, 11.4, 175.32, 75.97]
Result: Correct
IoU: 0.8717
Generation Time: 10.99s
Predicted bbox: [151, 14, 176, 76]
Ground truth: [145.04, 11.4, 175.32, 75.97]
Result: Correct
IoU: 0.7543
Dataset: refcocos_test
Caption: the horse that is not brown and is facing away from the car
Image: val2017/000000017178.jpg
Ground Truth Without CD With CD
Generation Time: 6.64s
Predicted bbox: [372, 167, 436, 262]
Ground truth: [374.97, 173.6, 433.33000000000004, 267.26]
Result: Correct
IoU: 0.8077
Generation Time: 10.11s
Predicted bbox: [458, 165, 515, 206]
Ground truth: [374.97, 173.6, 433.33000000000004, 267.26]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the silver car that is on the front left side of the horse
Image: val2017/000000368335.jpg
Ground Truth Without CD With CD
Generation Time: 4.90s
Predicted bbox: [58, 187, 162, 317]
Ground truth: [75.52, 209.12, 165.68, 326.19]
Result: Correct
IoU: 0.6327
Generation Time: 13.52s
Predicted bbox: [44, 187, 166, 324]
Ground truth: [75.52, 209.12, 165.68, 326.19]
Result: Correct
IoU: 0.6125
Dataset: refcocos_test
Caption: the person wearing a blue shirt walking behind the blue and white bus
Image: val2017/000000367680.jpg
Ground Truth Without CD With CD
Generation Time: 4.44s
Predicted bbox: [202, 153, 214, 180]
Ground truth: [236.2, 150.47, 250.04999999999998, 197.09]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.38s
Predicted bbox: [234, 150, 249, 193]
Ground truth: [236.2, 150.47, 250.04999999999998, 197.09]
Result: Correct
IoU: 0.7294
Dataset: refcocos_test
Caption: the horse that is not facing the camera and does not have a white tail
Image: val2017/000000234807.jpg
Ground Truth Without CD With CD
Generation Time: 4.72s
Predicted bbox: [133, 214, 220, 399]
Ground truth: [3.17, 238.5, 86.94, 291.1]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.11s
Predicted bbox: [131, 215, 217, 395]
Ground truth: [3.17, 238.5, 86.94, 291.1]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the rider that is not wearing red or black helmet
Image: val2017/000000507975.jpg
Ground Truth Without CD With CD
Generation Time: 4.26s
Predicted bbox: [320, 24, 456, 145]
Ground truth: [361.96, 23.07, 451.36, 101.44]
Result: Incorrect
IoU: 0.4186
Generation Time: 10.30s
Predicted bbox: [346, 26, 455, 124]
Ground truth: [361.96, 23.07, 451.36, 101.44]
Result: Correct
IoU: 0.6163
Dataset: refcocos_test
Caption: the person who is holding a camera and carrying a green bag
Image: val2017/000000338304.jpg
Ground Truth Without CD With CD
Generation Time: 3.94s
Predicted bbox: [1, 306, 91, 604]
Ground truth: [0.42, 298.6, 101.52, 488.25]
Result: Correct
IoU: 0.5543
Generation Time: 10.44s
Predicted bbox: [13, 308, 105, 490]
Ground truth: [0.42, 298.6, 101.52, 488.25]
Result: Correct
IoU: 0.7993
Dataset: refcocos_test
Caption: the traffic light with an arrow that is not pointing to the right
Image: val2017/000000555050.jpg
Ground Truth Without CD With CD
Generation Time: 4.73s
Predicted bbox: [2, 52, 42, 157]
Ground truth: [5.1, 50.72, 36.13, 132.82999999999998]
Result: Correct
IoU: 0.5916
Generation Time: 9.64s
Predicted bbox: [0, 50, 40, 147]
Ground truth: [5.1, 50.72, 36.13, 132.82999999999998]
Result: Correct
IoU: 0.6567
Dataset: refcocos_test
Caption: the second cow next to the cow with the least amount of brown
Image: val2017/000000206135.jpg
Ground Truth Without CD With CD
Generation Time: 6.43s
Predicted bbox: [172, 303, 235, 442]
Ground truth: [172.18, 302.22, 233.89000000000001, 439.78000000000003]
Result: Correct
IoU: 0.9586
Generation Time: 11.41s
Predicted bbox: [178, 302, 237, 440]
Ground truth: [172.18, 302.22, 233.89000000000001, 439.78000000000003]
Result: Correct
IoU: 0.8597
Dataset: refcocos_test
Caption: the bottle with a white top that is closest to the red bottle
Image: val2017/000000465129.jpg
Ground Truth Without CD With CD
Generation Time: 5.25s
Predicted bbox: [546, 328, 563, 365]
Ground truth: [543.74, 335.07, 560.86, 366.53]
Result: Correct
IoU: 0.6153
Generation Time: 10.95s
Predicted bbox: [550, 325, 565, 362]
Ground truth: [543.74, 335.07, 560.86, 366.53]
Result: Incorrect
IoU: 0.3651
Dataset: refcocos_test
Caption: the bottle that is not foil-wrapped and is located on the first shelf from the top
Image: val2017/000000506310.jpg
Ground Truth Without CD With CD
Generation Time: 6.67s
Predicted bbox: [34, 87, 92, 218]
Ground truth: [1.46, 76.32, 39.76, 241.29]
Result: Incorrect
IoU: 0.0573
Generation Time: 12.32s
Predicted bbox: [0, 78, 38, 233]
Ground truth: [1.46, 76.32, 39.76, 241.29]
Result: Correct
IoU: 0.8654
Dataset: refcocos_test
Caption: the spinning chair that is closest to the wine bottle
Image: val2017/000000519569.jpg
Ground Truth Without CD With CD
Generation Time: 4.30s
Predicted bbox: [121, 392, 245, 619]
Ground truth: [126.4, 391.5, 248.52, 615.54]
Result: Correct
IoU: 0.9143
Generation Time: 13.87s
Predicted bbox: [124, 394, 243, 620]
Ground truth: [126.4, 391.5, 248.52, 615.54]
Result: Correct
IoU: 0.9089
Dataset: refcocos_test
Caption: the man who is blow drying his hair using the hair drier
Image: val2017/000000178028.jpg
Ground Truth Without CD With CD
Generation Time: 3.90s
Predicted bbox: [177, 2, 246, 86]
Ground truth: None
Result: Unable to determine
Generation Time: 7.25s
Predicted bbox: [177, 1, 237, 89]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the cup on the counter that is mostly covered
Image: val2017/000000290768.jpg
Ground Truth Without CD With CD
Generation Time: 4.81s
Predicted bbox: [83, 192, 146, 267]
Ground truth: [152.34, 189.47, 170.51, 258.27]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.59s
Predicted bbox: [83, 195, 151, 269]
Ground truth: [152.34, 189.47, 170.51, 258.27]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the ceramic bowl that is empty
Image: val2017/000000182611.jpg
Ground Truth Without CD With CD
Generation Time: 3.70s
Predicted bbox: [316, 553, 416, 644]
Ground truth: [136.4, 537.44, 185.67000000000002, 581.34]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.32s
Predicted bbox: [318, 560, 414, 644]
Ground truth: [136.4, 537.44, 185.67000000000002, 581.34]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person that is not wearing a uniform and is blocked by the person who is wearing a hat
Image: val2017/000000228214.jpg
Ground Truth Without CD With CD
Generation Time: 5.83s
Predicted bbox: [304, 474, 420, 637]
Ground truth: [287.84, 475.7, 406.40999999999997, 640]
Result: Correct
IoU: 0.7553
Generation Time: 10.39s
Predicted bbox: [292, 469, 420, 644]
Ground truth: [287.84, 475.7, 406.40999999999997, 640]
Result: Correct
IoU: 0.8143
Dataset: refcocos_test
Caption: the first toothbrush from the right side that is not blue
Image: val2017/000000293390.jpg
Ground Truth Without CD With CD
Generation Time: 4.76s
Predicted bbox: [466, 2, 474, 53]
Ground truth: [494.62, 11.04, 502.18, 49.46]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.57s
Predicted bbox: [467, 10, 476, 47]
Ground truth: [494.62, 11.04, 502.18, 49.46]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the smallest bottle without a blue or green cap
Image: val2017/000000384808.jpg
Ground Truth Without CD With CD
Generation Time: 6.97s
Predicted bbox: [46, 264, 63, 329]
Ground truth: [48.91, 268.67, 63.449999999999996, 325.36]
Result: Correct
IoU: 0.7065
Generation Time: 15.20s
Predicted bbox: [44, 261, 63, 327]
Ground truth: [48.91, 268.67, 63.449999999999996, 325.36]
Result: Correct
IoU: 0.6243
Dataset: refcocos_test
Caption: the bottle that is not in the refrigerator and has blue writing on its label
Image: val2017/000000425226.jpg
Ground Truth Without CD With CD
Generation Time: 6.71s
Predicted bbox: [283, 350, 308, 396]
Ground truth: [299.65, 2.09, 321.41999999999996, 44.269999999999996]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.98s
Predicted bbox: [301, 0, 322, 46]
Ground truth: [299.65, 2.09, 321.41999999999996, 44.269999999999996]
Result: Correct
IoU: 0.8420
Dataset: refcocos_test
Caption: the bottle that is neither green nor has a rectangular cap
Image: val2017/000000292005.jpg
Ground Truth Without CD With CD
Generation Time: 4.90s
Predicted bbox: [212, 455, 235, 509]
Ground truth: [201.81, 453.48, 220.65, 508.34000000000003]
Result: Incorrect
IoU: 0.2543
Generation Time: 12.32s
Predicted bbox: [210, 456, 235, 507]
Ground truth: [201.81, 453.48, 220.65, 508.34000000000003]
Result: Incorrect
IoU: 0.3077
Dataset: refcocos_test
Caption: the chair close to the stove and partially covered by the banana
Image: val2017/000000480122.jpg
Ground Truth Without CD With CD
Generation Time: 4.82s
Predicted bbox: [69, 398, 192, 481]
Ground truth: [217.57, 359.3, 294.26, 430.21000000000004]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.25s
Predicted bbox: [65, 398, 192, 482]
Ground truth: [217.57, 359.3, 294.26, 430.21000000000004]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the partially empty spray bottle with green liquid
Image: val2017/000000197796.jpg
Ground Truth Without CD With CD
Generation Time: 4.96s
Predicted bbox: [314, 56, 340, 154]
Ground truth: [312.48, 57.63, 337.43, 155.96]
Result: Correct
IoU: 0.8231
Generation Time: 8.34s
Predicted bbox: [313, 54, 341, 154]
Ground truth: [312.48, 57.63, 337.43, 155.96]
Result: Correct
IoU: 0.8121
Dataset: refcocos_test
Caption: the pink cup on the second shelf from the top
Image: val2017/000000481386.jpg
Ground Truth Without CD With CD
Generation Time: 4.54s
Predicted bbox: [306, 142, 331, 158]
Ground truth: [287.43, 135.07, 311.46000000000004, 160.73]
Result: Incorrect
IoU: 0.0940
Generation Time: 8.95s
Predicted bbox: [344, 120, 374, 159]
Ground truth: [287.43, 135.07, 311.46000000000004, 160.73]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the white ceramic bowl that is not on the counter
Image: val2017/000000397133.jpg
Ground Truth Without CD With CD
Generation Time: 4.58s
Predicted bbox: [32, 337, 102, 381]
Ground truth: [157.2, 114.15, 175.06, 129.97]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.39s
Predicted bbox: [31, 339, 101, 381]
Ground truth: [157.2, 114.15, 175.06, 129.97]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the red bottle that is not located on the first shelf from the top
Image: val2017/000000173302.jpg
Ground Truth Without CD With CD
Generation Time: 5.15s
Predicted bbox: [110, 170, 123, 187]
Ground truth: [435.32, 178.03, 442.81, 191.89]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.18s
Predicted bbox: [109, 170, 123, 187]
Ground truth: [435.32, 178.03, 442.81, 191.89]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the tall bottle that is closest to the stove
Image: val2017/000000523100.jpg
Ground Truth Without CD With CD
Generation Time: 4.51s
Predicted bbox: [153, 86, 188, 175]
Ground truth: [153.77, 86.61, 187.13, 178.3]
Result: Correct
IoU: 0.9143
Generation Time: 9.09s
Predicted bbox: [155, 58, 189, 178]
Ground truth: [153.77, 86.61, 187.13, 178.3]
Result: Correct
IoU: 0.6987
Dataset: refcocos_test
Caption: the woman who is wearing pink clothing and not smiling
Image: val2017/000000084241.jpg
Ground Truth Without CD With CD
Generation Time: 4.59s
Predicted bbox: [200, 20, 293, 292]
Ground truth: [198.89, 23.16, 284.51, 289.42]
Result: Correct
IoU: 0.8793
Generation Time: 8.92s
Predicted bbox: [194, 21, 287, 296]
Ground truth: [198.89, 23.16, 284.51, 289.42]
Result: Correct
IoU: 0.8914
Dataset: refcocos_test
Caption: the bottle behind the stove with yellow and red wrapping
Image: val2017/000000074209.jpg
Ground Truth Without CD With CD
Generation Time: 4.66s
Predicted bbox: [163, 191, 178, 226]
Ground truth: [171.43, 197.48, 181.38, 223.79]
Result: Incorrect
IoU: 0.2816
Generation Time: 8.39s
Predicted bbox: [133, 194, 146, 233]
Ground truth: [171.43, 197.48, 181.38, 223.79]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the spoon inside the glass cup filled with water
Image: val2017/000000239627.jpg
Ground Truth Without CD With CD
Generation Time: 4.18s
Predicted bbox: [459, 284, 541, 357]
Ground truth: [425.77, 173.22, 501.2, 248.64]
Result: Incorrect
IoU: 0.0000
Generation Time: 14.87s
Predicted bbox: [458, 285, 541, 357]
Ground truth: [425.77, 173.22, 501.2, 248.64]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a hand soap on the bathroom counter next to a pile of paper towels
Image: val2017/000000195165.jpg
Ground Truth Without CD With CD
Generation Time: 5.62s
Predicted bbox: [329, 265, 350, 319]
Ground truth: [329.8, 263.35, 347.95, 312.64000000000004]
Result: Correct
IoU: 0.7429
Generation Time: 10.54s
Predicted bbox: [329, 265, 351, 321]
Ground truth: [329.8, 263.35, 347.95, 312.64000000000004]
Result: Correct
IoU: 0.6852
Dataset: refcocos_test
Caption: the reflection in the mirror of a cup not containing a toothbrush
Image: val2017/000000492878.jpg
Ground Truth Without CD With CD
Generation Time: 5.57s
Predicted bbox: [49, 81, 187, 203]
Ground truth: [53.06, 77.26, 182.38, 282.86]
Result: Correct
IoU: 0.5707
Generation Time: 12.56s
Predicted bbox: [47, 83, 187, 226]
Ground truth: [53.06, 77.26, 182.38, 282.86]
Result: Correct
IoU: 0.6577
Dataset: refcocos_test
Caption: the metal pot on the left stove
Image: val2017/000000175364.jpg
Ground Truth Without CD With CD
Generation Time: 4.32s
Predicted bbox: [137, 248, 261, 417]
Ground truth: None
Result: Unable to determine
Generation Time: 13.23s
Predicted bbox: [160, 252, 260, 345]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the person who is neither facing the camera nor wearing a brown jacket
Image: val2017/000000438774.jpg
Ground Truth Without CD With CD
Generation Time: 4.22s
Predicted bbox: [343, 49, 473, 375]
Ground truth: [333.68, 51.59, 458.49, 382.25]
Result: Correct
IoU: 0.8067
Generation Time: 10.73s
Predicted bbox: [349, 47, 470, 377]
Ground truth: [333.68, 51.59, 458.49, 382.25]
Result: Correct
IoU: 0.7818
Dataset: refcocos_test
Caption: a plastic bottle without a label
Image: val2017/000000485424.jpg
Ground Truth Without CD With CD
Generation Time: 4.54s
Predicted bbox: [522, 169, 550, 244]
Ground truth: [50.52, 237.88, 113.97, 315.43]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.17s
Predicted bbox: [522, 170, 552, 249]
Ground truth: [50.52, 237.88, 113.97, 315.43]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a red bowl that is not on the counter nor the stove
Image: val2017/000000530836.jpg
Ground Truth Without CD With CD
Generation Time: 5.29s
Predicted bbox: [0, 196, 31, 215]
Ground truth: [0, 190.59, 30.24, 209.08]
Result: Correct
IoU: 0.5256
Generation Time: 10.97s
Predicted bbox: [0, 196, 30, 213]
Ground truth: [0, 190.59, 30.24, 209.08]
Result: Correct
IoU: 0.5798
Dataset: refcocos_test
Caption: a woman wearing sandals
Image: val2017/000000177934.jpg
Ground Truth Without CD With CD
Generation Time: 4.36s
Predicted bbox: [350, 151, 415, 336]
Ground truth: [352.76, 154.48, 405.12, 338.5]
Result: Correct
IoU: 0.7819
Generation Time: 9.44s
Predicted bbox: [349, 154, 409, 342]
Ground truth: [352.76, 154.48, 405.12, 338.5]
Result: Correct
IoU: 0.8542
Dataset: refcocos_test
Caption: the bottle with a black cap, second from the left
Image: val2017/000000040471.jpg
Ground Truth Without CD With CD
Generation Time: 7.13s
Predicted bbox: [116, 310, 135, 342]
Ground truth: [309.15, 323.63, 317.90999999999997, 339.51]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.58s
Predicted bbox: [112, 311, 133, 346]
Ground truth: [309.15, 323.63, 317.90999999999997, 339.51]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a bowl on a metal wall-mounted open cabinet that is not stacked
Image: val2017/000000455597.jpg
Ground Truth Without CD With CD
Generation Time: 4.55s
Predicted bbox: [177, 150, 236, 177]
Ground truth: [182.03, 164.87, 209.75, 174.99]
Result: Incorrect
IoU: 0.1761
Generation Time: 15.03s
Predicted bbox: [178, 151, 236, 175]
Ground truth: [182.03, 164.87, 209.75, 174.99]
Result: Incorrect
IoU: 0.2015
Dataset: refcocos_test
Caption: the second bottle from the right on the kitchen countertop
Image: val2017/000000308799.jpg
Ground Truth Without CD With CD
Generation Time: 3.80s
Predicted bbox: [231, 226, 242, 262]
Ground truth: [243.36, 224.27, 253.26000000000002, 254.91000000000003]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.39s
Predicted bbox: [235, 223, 244, 260]
Ground truth: [243.36, 224.27, 253.26000000000002, 254.91000000000003]
Result: Incorrect
IoU: 0.0318
Dataset: refcocos_test
Caption: the plant that is not on the windowsill and is located on the right side
Image: val2017/000000045229.jpg
Ground Truth Without CD With CD
Generation Time: 3.92s
Predicted bbox: [318, 383, 387, 427]
Ground truth: [348.72, 388.45, 380.18, 423.94]
Result: Incorrect
IoU: 0.3678
Generation Time: 10.19s
Predicted bbox: [316, 379, 387, 425]
Ground truth: [348.72, 388.45, 380.18, 423.94]
Result: Incorrect
IoU: 0.3419
Dataset: refcocos_test
Caption: a pot on the stovetop next to the coffee machine
Image: val2017/000000109976.jpg
Ground Truth Without CD With CD
Generation Time: 5.11s
Predicted bbox: [126, 164, 383, 476]
Ground truth: None
Result: Unable to determine
Generation Time: 10.34s
Predicted bbox: [359, 191, 383, 253]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: a black chair that doesn't have any object on it and has a backrest
Image: val2017/000000441247.jpg
Ground Truth Without CD With CD
Generation Time: 5.65s
Predicted bbox: [223, 221, 306, 345]
Ground truth: [221.7, 220.73, 301.35, 347.73]
Result: Correct
IoU: 0.9086
Generation Time: 11.10s
Predicted bbox: [222, 221, 301, 343]
Ground truth: [221.7, 220.73, 301.35, 347.73]
Result: Correct
IoU: 0.9528
Dataset: refcocos_test
Caption: a stack of two books that is not placed next to the chair
Image: val2017/000000093437.jpg
Ground Truth Without CD With CD
Generation Time: 4.11s
Predicted bbox: [430, 282, 491, 314]
Ground truth: [490.55, 212.2, 523.83, 222.91]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.32s
Predicted bbox: [431, 283, 493, 314]
Ground truth: [490.55, 212.2, 523.83, 222.91]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a chair that is neither close to the wall nor closest to the camera
Image: val2017/000000221708.jpg
Ground Truth Without CD With CD
Generation Time: 4.66s
Predicted bbox: [270, 260, 374, 478]
Ground truth: [210.25, 255.88, 285.67, 424.44]
Result: Incorrect
IoU: 0.0785
Generation Time: 10.46s
Predicted bbox: [270, 262, 376, 476]
Ground truth: [210.25, 255.88, 285.67, 424.44]
Result: Incorrect
IoU: 0.0775
Dataset: refcocos_test
Caption: the chair farthest from the microwave
Image: val2017/000000216497.jpg
Ground Truth Without CD With CD
Generation Time: 4.09s
Predicted bbox: [301, 242, 349, 377]
Ground truth: [386.03, 269.83, 468.07, 451.40999999999997]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.45s
Predicted bbox: [392, 270, 466, 449]
Ground truth: [386.03, 269.83, 468.07, 451.40999999999997]
Result: Correct
IoU: 0.8892
Dataset: refcocos_test
Caption: the upper oven embedded in the white cabinet
Image: val2017/000000458768.jpg
Ground Truth Without CD With CD
Generation Time: 5.14s
Predicted bbox: [412, 240, 431, 288]
Ground truth: [408.36, 206.98, 428.23, 252.88]
Result: Incorrect
IoU: 0.1294
Generation Time: 14.30s
Predicted bbox: [411, 227, 433, 290]
Ground truth: [408.36, 206.98, 428.23, 252.88]
Result: Incorrect
IoU: 0.2408
Dataset: refcocos_test
Caption: the middle chair among chairs with green mat
Image: val2017/000000543047.jpg
Ground Truth Without CD With CD
Generation Time: 4.30s
Predicted bbox: [361, 214, 419, 287]
Ground truth: [397.23, 177.89, 419.54, 206.7]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.25s
Predicted bbox: [355, 210, 423, 285]
Ground truth: [397.23, 177.89, 419.54, 206.7]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the middle chair at the dining table not on the sofa side
Image: val2017/000000472046.jpg
Ground Truth Without CD With CD
Generation Time: 5.72s
Predicted bbox: [184, 244, 226, 311]
Ground truth: [70.16, 256.44, 100.33, 277.29]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.72s
Predicted bbox: [156, 242, 228, 314]
Ground truth: [70.16, 256.44, 100.33, 277.29]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a person holding a green umbrella walking away from the camera
Image: val2017/000000045596.jpg
Ground Truth Without CD With CD
Generation Time: 5.15s
Predicted bbox: [122, 340, 142, 375]
Ground truth: None
Result: Unable to determine
Generation Time: 19.45s
Predicted bbox: [122, 343, 140, 376]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: a bicycle without a basket that is partially blocked by a yellow pole
Image: val2017/000000259830.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [305, 426, 417, 581]
Ground truth: [338.58, 426.67, 425, 579.44]
Result: Correct
IoU: 0.6447
Generation Time: 9.44s
Predicted bbox: [267, 412, 420, 581]
Ground truth: [338.58, 426.67, 425, 579.44]
Result: Incorrect
IoU: 0.4672
Dataset: refcocos_test
Caption: a black car following a silver car, moving away from the camera
Image: val2017/000000309391.jpg
Ground Truth Without CD With CD
Generation Time: 5.19s
Predicted bbox: [259, 237, 328, 296]
Ground truth: [450.31, 125.53, 475.14, 145.32]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.59s
Predicted bbox: [260, 234, 328, 296]
Ground truth: [450.31, 125.53, 475.14, 145.32]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the car next to the car with its door open
Image: val2017/000000357737.jpg
Ground Truth Without CD With CD
Generation Time: 4.27s
Predicted bbox: [33, 84, 265, 152]
Ground truth: [576.36, 88.04, 583.4300000000001, 91.39]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.09s
Predicted bbox: [66, 111, 380, 291]
Ground truth: [576.36, 88.04, 583.4300000000001, 91.39]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a bicycle that is not placed on the ground
Image: val2017/000000055022.jpg
Ground Truth Without CD With CD
Generation Time: 3.84s
Predicted bbox: [234, 0, 280, 81]
Ground truth: [245.25, 0.76, 281.89, 80.16000000000001]
Result: Correct
IoU: 0.7118
Generation Time: 9.06s
Predicted bbox: [21, 188, 336, 637]
Ground truth: [245.25, 0.76, 281.89, 80.16000000000001]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a person who is neither standing, sitting, walking, nor skateboarding
Image: val2017/000000087038.jpg
Ground Truth Without CD With CD
Generation Time: 5.33s
Predicted bbox: [202, 227, 218, 264]
Ground truth: [257.85, 224.48, 301.98, 321.48]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.24s
Predicted bbox: [630, 221, 644, 265]
Ground truth: [257.85, 224.48, 301.98, 321.48]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a person who is not on the sidewalk and is carrying a bag that is not blue
Image: val2017/000000577932.jpg
Ground Truth Without CD With CD
Generation Time: 5.32s
Predicted bbox: [221, 230, 322, 475]
Ground truth: [231.84, 233.37, 314.82, 492.06]
Result: Correct
IoU: 0.7664
Generation Time: 14.19s
Predicted bbox: [215, 229, 318, 478]
Ground truth: [231.84, 233.37, 314.82, 492.06]
Result: Correct
IoU: 0.7571
Dataset: refcocos_test
Caption: a motorcycle facing toward the road with a red seat
Image: val2017/000000356387.jpg
Ground Truth Without CD With CD
Generation Time: 4.11s
Predicted bbox: [144, 217, 322, 336]
Ground truth: [216.26, 220.98, 312.12, 308.39]
Result: Incorrect
IoU: 0.3956
Generation Time: 9.23s
Predicted bbox: [141, 220, 325, 328]
Ground truth: [216.26, 220.98, 312.12, 308.39]
Result: Incorrect
IoU: 0.4217
Dataset: refcocos_test
Caption: a car partially blocked by a black Audi sedan
Image: val2017/000000122166.jpg
Ground Truth Without CD With CD
Generation Time: 4.44s
Predicted bbox: [318, 295, 446, 397]
Ground truth: [367.19, 280.64, 453.58, 358.58]
Result: Incorrect
IoU: 0.3391
Generation Time: 12.31s
Predicted bbox: [319, 297, 450, 399]
Ground truth: [367.19, 280.64, 453.58, 358.58]
Result: Incorrect
IoU: 0.3401
Dataset: refcocos_test
Caption: a vehicle that has a backrest and does not have four wheels
Image: val2017/000000441586.jpg
Ground Truth Without CD With CD
Generation Time: 4.79s
Predicted bbox: [274, 176, 491, 402]
Ground truth: [395.66, 146.02, 426.92, 224.5]
Result: Incorrect
IoU: 0.0303
Generation Time: 10.78s
Predicted bbox: [277, 156, 494, 404]
Ground truth: [395.66, 146.02, 426.92, 224.5]
Result: Incorrect
IoU: 0.0396
Dataset: refcocos_test
Caption: a boat with a red top and a white hull
Image: val2017/000000228436.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [102, 180, 231, 272]
Ground truth: [287.59, 138.45, 356.13, 167.37]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.95s
Predicted bbox: [521, 143, 644, 201]
Ground truth: [287.59, 138.45, 356.13, 167.37]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: a person next to a bicycle lying on the ground who is not making a phone call
Image: val2017/000000414510.jpg
Ground Truth Without CD With CD
Generation Time: 4.75s
Predicted bbox: [104, 248, 257, 408]
Ground truth: [54.6, 267.4, 134.32, 388.39]
Result: Incorrect
IoU: 0.1204
Generation Time: 15.50s
Predicted bbox: [57, 250, 254, 407]
Ground truth: [54.6, 267.4, 134.32, 388.39]
Result: Incorrect
IoU: 0.2997
Dataset: refcocos_test
Caption: a bus moving toward the camera with blue on its front
Image: val2017/000000210273.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [369, 116, 431, 194]
Ground truth: [362.04, 118.05, 424.51, 194.73000000000002]
Result: Correct
IoU: 0.7793
Generation Time: 14.50s
Predicted bbox: [374, 117, 426, 190]
Ground truth: [362.04, 118.05, 424.51, 194.73000000000002]
Result: Correct
IoU: 0.7339
Dataset: refcocos_test
Caption: a person standing at the doorway and eating something
Image: val2017/000000507037.jpg
Ground Truth Without CD With CD
Generation Time: 4.41s
Predicted bbox: [66, 225, 137, 424]
Ground truth: [0, 246.23, 25.34, 383.58]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.01s
Predicted bbox: [69, 226, 140, 422]
Ground truth: [0, 246.23, 25.34, 383.58]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the man wearing backpack to the left side of the red-shirt girl, not the boy
Image: val2017/000000350122.jpg
Ground Truth Without CD With CD
Generation Time: 4.87s
Predicted bbox: [94, 204, 149, 375]
Ground truth: [96.9, 206.17, 151.79000000000002, 380.67999999999995]
Result: Correct
IoU: 0.8634
Generation Time: 15.99s
Predicted bbox: [122, 205, 198, 376]
Ground truth: [96.9, 206.17, 151.79000000000002, 380.67999999999995]
Result: Incorrect
IoU: 0.2888
Dataset: refcocos_test
Caption: the one standing above stairs, not near bike or motor
Image: val2017/000000038829.jpg
Ground Truth Without CD With CD
Generation Time: 5.85s
Predicted bbox: [406, 126, 455, 214]
Ground truth: [392.03, 60.98, 418.64, 133.46]
Result: Incorrect
IoU: 0.0153
Generation Time: 13.31s
Predicted bbox: [415, 126, 457, 210]
Ground truth: [392.03, 60.98, 418.64, 133.46]
Result: Incorrect
IoU: 0.0050
Dataset: refcocos_test
Caption: the black motor on the front, close to the red motor, not close to the green bike
Image: val2017/000000291634.jpg
Ground Truth Without CD With CD
Generation Time: 7.06s
Predicted bbox: [56, 225, 399, 447]
Ground truth: [17, 224.07, 168.41, 384.39]
Result: Incorrect
IoU: 0.2172
Generation Time: 13.17s
Predicted bbox: [39, 237, 398, 455]
Ground truth: [17, 224.07, 168.41, 384.39]
Result: Incorrect
IoU: 0.2285
Dataset: refcocos_test
Caption: the third one from the back of the boat
Image: val2017/000000395180.jpg
Ground Truth Without CD With CD
Generation Time: 4.69s
Predicted bbox: [466, 184, 481, 203]
Ground truth: [477.48, 189.92, 493.46000000000004, 203.54999999999998]
Result: Incorrect
IoU: 0.1008
Generation Time: 10.68s
Predicted bbox: [484, 187, 499, 202]
Ground truth: [477.48, 189.92, 493.46000000000004, 203.54999999999998]
Result: Incorrect
IoU: 0.3478
Dataset: refcocos_test
Caption: the motor close to the woman's red motor
Image: val2017/000000226417.jpg
Ground Truth Without CD With CD
Generation Time: 6.17s
Predicted bbox: [200, 196, 231, 263]
Ground truth: [194.7, 197.58, 231.32999999999998, 262.70000000000005]
Result: Correct
IoU: 0.8261
Generation Time: 11.75s
Predicted bbox: [195, 193, 233, 266]
Ground truth: [194.7, 197.58, 231.32999999999998, 262.70000000000005]
Result: Correct
IoU: 0.8469
Dataset: refcocos_test
Caption: The light near the '24 hour' sign with only three signals
Image: val2017/000000301376.jpg
Ground Truth Without CD With CD
Generation Time: 6.26s
Predicted bbox: [265, 212, 287, 307]
Ground truth: [83.29, 94.92, 111.14000000000001, 152.57]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.11s
Predicted bbox: [63, 3, 98, 97]
Ground truth: [83.29, 94.92, 111.14000000000001, 152.57]
Result: Incorrect
IoU: 0.0063
Dataset: refcocos_test
Caption: the guy on the right side, red shirt, with no hat
Image: val2017/000000455624.jpg
Ground Truth Without CD With CD
Generation Time: 4.47s
Predicted bbox: [418, 137, 450, 184]
Ground truth: [614.54, 143.23, 640, 206.35999999999999]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.72s
Predicted bbox: [423, 137, 454, 187]
Ground truth: [614.54, 143.23, 640, 206.35999999999999]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: red cat on the blue motor
Image: val2017/000000139099.jpg
Ground Truth Without CD With CD
Generation Time: 4.78s
Predicted bbox: [220, 280, 644, 448]
Ground truth: None
Result: Unable to determine
Generation Time: 7.79s
Predicted bbox: [215, 284, 644, 448]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the girl who is not a reflection in the glass
Image: val2017/000000292456.jpg
Ground Truth Without CD With CD
Generation Time: 4.59s
Predicted bbox: [326, 65, 447, 339]
Ground truth: None
Result: Unable to determine
Generation Time: 10.12s
Predicted bbox: [315, 63, 446, 338]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the guy closest to the motor with a black box in the back
Image: val2017/000000534605.jpg
Ground Truth Without CD With CD
Generation Time: 5.72s
Predicted bbox: [348, 98, 402, 163]
Ground truth: [283.68, 97.14, 323.75, 216.22]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.90s
Predicted bbox: [342, 99, 403, 169]
Ground truth: [283.68, 97.14, 323.75, 216.22]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the light with green signal on
Image: val2017/000000178982.jpg
Ground Truth Without CD With CD
Generation Time: 3.76s
Predicted bbox: [330, 28, 532, 90]
Ground truth: None
Result: Unable to determine
Generation Time: 11.15s
Predicted bbox: [327, 29, 519, 93]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the motorcycle being leaned on by the person in the striped shirt
Image: val2017/000000574702.jpg
Ground Truth Without CD With CD
Generation Time: 7.77s
Predicted bbox: [172, 284, 329, 504]
Ground truth: [85.46, 215.72, 119.69, 302.77]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.91s
Predicted bbox: [198, 292, 332, 504]
Ground truth: [85.46, 215.72, 119.69, 302.77]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the guy on the wider motor, and his clothes is orange
Image: val2017/000000246963.jpg
Ground Truth Without CD With CD
Generation Time: 5.36s
Predicted bbox: [147, 118, 245, 242]
Ground truth: [167.62, 151.24, 233.54000000000002, 242.58]
Result: Incorrect
IoU: 0.4908
Generation Time: 11.09s
Predicted bbox: [219, 107, 280, 219]
Ground truth: [167.62, 151.24, 233.54000000000002, 242.58]
Result: Incorrect
IoU: 0.0830
Dataset: refcocos_test
Caption: the kite that flies the third highest in the middle
Image: val2017/000000345027.jpg
Ground Truth Without CD With CD
Generation Time: 4.16s
Predicted bbox: [311, 178, 330, 185]
Ground truth: [310.6, 183.89, 326.46000000000004, 189.20999999999998]
Result: Incorrect
IoU: 0.0857
Generation Time: 12.66s
Predicted bbox: [312, 178, 332, 186]
Ground truth: [310.6, 183.89, 326.46000000000004, 189.20999999999998]
Result: Incorrect
IoU: 0.1427
Dataset: refcocos_test
Caption: the book with cover of same color as the page
Image: val2017/000000200839.jpg
Ground Truth Without CD With CD
Generation Time: 5.01s
Predicted bbox: [358, 196, 412, 261]
Ground truth: [143.97, 240.83, 176.81, 268.65000000000003]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.18s
Predicted bbox: [352, 192, 408, 259]
Ground truth: [143.97, 240.83, 176.81, 268.65000000000003]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the person on the right side of the row with no one wearing hat but one wearing hairband
Image: val2017/000000057672.jpg
Ground Truth Without CD With CD
Generation Time: 6.87s
Predicted bbox: [553, 196, 596, 267]
Ground truth: [417.03, 191.5, 484.16999999999996, 289.63]
Result: Incorrect
IoU: 0.0000
Generation Time: 12.05s
Predicted bbox: [556, 197, 609, 264]
Ground truth: [417.03, 191.5, 484.16999999999996, 289.63]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the guy with brown jeans, black hoodie and a suitcase walking in the street
Image: val2017/000000138639.jpg
Ground Truth Without CD With CD
Generation Time: 4.87s
Predicted bbox: [384, 318, 451, 460]
Ground truth: None
Result: Unable to determine
Generation Time: 14.06s
Predicted bbox: [387, 321, 450, 457]
Ground truth: None
Result: Unable to determine
Dataset: refcocos_test
Caption: the man between the yellow-coat guy and the blue-shirt woman, hair not yellow
Image: val2017/000000078748.jpg
Ground Truth Without CD With CD
Generation Time: 5.96s
Predicted bbox: [252, 17, 303, 110]
Ground truth: [244.74, 13.38, 289.17, 111.38]
Result: Correct
IoU: 0.6129
Generation Time: 14.18s
Predicted bbox: [248, 17, 301, 109]
Ground truth: [244.74, 13.38, 289.17, 111.38]
Result: Correct
IoU: 0.6959
Dataset: refcocos_test
Caption: the white car next to the red car, in front of the car that's tilted to the side
Image: val2017/000000111086.jpg
Ground Truth Without CD With CD
Generation Time: 5.62s
Predicted bbox: [163, 227, 296, 284]
Ground truth: [126.86, 201, 156.21, 232.27]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.95s
Predicted bbox: [166, 227, 295, 285]
Ground truth: [126.86, 201, 156.21, 232.27]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the man with glasses, not in the car nor driving the motor
Image: val2017/000000461751.jpg
Ground Truth Without CD With CD
Generation Time: 5.58s
Predicted bbox: [501, 159, 644, 438]
Ground truth: [464.34, 120.29, 640, 569]
Result: Incorrect
IoU: 0.4851
Generation Time: 9.75s
Predicted bbox: [498, 130, 644, 446]
Ground truth: [464.34, 120.29, 640, 569]
Result: Correct
IoU: 0.5603
Dataset: refcocos_test
Caption: yellow taxi, not on the slope, close to the big bus
Image: val2017/000000336232.jpg
Ground Truth Without CD With CD
Generation Time: 4.23s
Predicted bbox: [413, 214, 644, 415]
Ground truth: [412.46, 107.39, 471.75, 156.45]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.75s
Predicted bbox: [413, 202, 644, 415]
Ground truth: [412.46, 107.39, 471.75, 156.45]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the man with shirt hair, sitting behind the long-hair guy
Image: val2017/000000248334.jpg
Ground Truth Without CD With CD
Generation Time: 6.03s
Predicted bbox: [427, 177, 456, 216]
Ground truth: [469.42, 186.72, 487.43, 221.01999999999998]
Result: Incorrect
IoU: 0.0000
Generation Time: 11.90s
Predicted bbox: [334, 173, 395, 213]
Ground truth: [469.42, 186.72, 487.43, 221.01999999999998]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the one not inside the bus, but close to the door
Image: val2017/000000445834.jpg
Ground Truth Without CD With CD
Generation Time: 4.29s
Predicted bbox: [428, 179, 519, 407]
Ground truth: [424.1, 180.42, 517.0500000000001, 415.7]
Result: Correct
IoU: 0.8991
Generation Time: 10.38s
Predicted bbox: [427, 180, 516, 407]
Ground truth: [424.1, 180.42, 517.0500000000001, 415.7]
Result: Correct
IoU: 0.9205
Dataset: refcocos_test
Caption: the one with dark color cloth, not riding bike
Image: val2017/000000472623.jpg
Ground Truth Without CD With CD
Generation Time: 4.65s
Predicted bbox: [123, 243, 152, 305]
Ground truth: [125.81, 239.86, 152.76, 305.90000000000003]
Result: Correct
IoU: 0.8310
Generation Time: 11.52s
Predicted bbox: [123, 241, 153, 304]
Ground truth: [125.81, 239.86, 152.76, 305.90000000000003]
Result: Correct
IoU: 0.8610
Dataset: refcocos_test
Caption: the red-shirt person, with no hat on
Image: val2017/000000244833.jpg
Ground Truth Without CD With CD
Generation Time: 4.33s
Predicted bbox: [187, 9, 275, 70]
Ground truth: [547.17, 12.75, 613.73, 119.36]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.51s
Predicted bbox: [193, 10, 272, 68]
Ground truth: [547.17, 12.75, 613.73, 119.36]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the bag carried by the person with a purple umbrella and pulling a small suitcase
Image: val2017/000000191845.jpg
Ground Truth Without CD With CD
Generation Time: 4.81s
Predicted bbox: [411, 244, 471, 287]
Ground truth: [436.05, 217.51, 466.85, 254.57999999999998]
Result: Incorrect
IoU: 0.0960
Generation Time: 14.38s
Predicted bbox: [456, 306, 543, 364]
Ground truth: [436.05, 217.51, 466.85, 254.57999999999998]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the guy on the second row, not near the ones with umbrella
Image: val2017/000000267537.jpg
Ground Truth Without CD With CD
Generation Time: 4.44s
Predicted bbox: [74, 0, 186, 172]
Ground truth: [498.75, 0.28, 635.88, 234.72]
Result: Incorrect
IoU: 0.0000
Generation Time: 10.08s
Predicted bbox: [73, 0, 187, 173]
Ground truth: [498.75, 0.28, 635.88, 234.72]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the girl next to the man wearing backpack pointing somewhere, on the stairs
Image: val2017/000000370486.jpg
Ground Truth Without CD With CD
Generation Time: 4.82s
Predicted bbox: [34, 147, 101, 473]
Ground truth: [41.32, 53.59, 76.08, 113.12]
Result: Incorrect
IoU: 0.0000
Generation Time: 8.38s
Predicted bbox: [37, 147, 94, 482]
Ground truth: [41.32, 53.59, 76.08, 113.12]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the chair only able to be seen through the middle window, on the left
Image: val2017/000000333745.jpg
Ground Truth Without CD With CD
Generation Time: 4.65s
Predicted bbox: [138, 35, 250, 117]
Ground truth: [150.39, 33.71, 206.79, 95.23]
Result: Incorrect
IoU: 0.3670
Generation Time: 10.24s
Predicted bbox: [141, 33, 250, 115]
Ground truth: [150.39, 33.71, 206.79, 95.23]
Result: Incorrect
IoU: 0.3882
Dataset: refcocos_test
Caption: the person on the right side from the single man's perspective, not holding beer
Image: val2017/000000074058.jpg
Ground Truth Without CD With CD
Generation Time: 5.40s
Predicted bbox: [125, 167, 220, 294]
Ground truth: [27.67, 158.53, 102.07000000000001, 334.65]
Result: Incorrect
IoU: 0.0000
Generation Time: 14.73s
Predicted bbox: [127, 167, 217, 294]
Ground truth: [27.67, 158.53, 102.07000000000001, 334.65]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the bed no one sitting on, and not the upper bed
Image: val2017/000000393569.jpg
Ground Truth Without CD With CD
Generation Time: 3.84s
Predicted bbox: [353, 222, 589, 453]
Ground truth: [452.97, 169.19, 592.4300000000001, 322.7]
Result: Incorrect
IoU: 0.2201
Generation Time: 10.04s
Predicted bbox: [365, 294, 585, 452]
Ground truth: [452.97, 169.19, 592.4300000000001, 322.7]
Result: Incorrect
IoU: 0.0723
Dataset: refcocos_test
Caption: the cup facing right, not the rightmost one
Image: val2017/000000292060.jpg
Ground Truth Without CD With CD
Generation Time: 5.74s
Predicted bbox: [265, 274, 293, 304]
Ground truth: [271.28, 271.82, 296.09, 298.78999999999996]
Result: Correct
IoU: 0.5547
Generation Time: 11.17s
Predicted bbox: [267, 273, 290, 302]
Ground truth: [271.28, 271.82, 296.09, 298.78999999999996]
Result: Correct
IoU: 0.5658
Dataset: refcocos_test
Caption: the white cup, beside the sink
Image: val2017/000000437898.jpg
Ground Truth Without CD With CD
Generation Time: 3.87s
Predicted bbox: [68, 224, 83, 243]
Ground truth: [65.61, 228.82, 82.41, 249.57999999999998]
Result: Incorrect
IoU: 0.4758
Generation Time: 9.16s
Predicted bbox: [67, 224, 85, 242]
Ground truth: [65.61, 228.82, 82.41, 249.57999999999998]
Result: Incorrect
IoU: 0.4324
Dataset: refcocos_test
Caption: the green cup on top of an array of glasses, not upright
Image: val2017/000000052996.jpg
Ground Truth Without CD With CD
Generation Time: 5.61s
Predicted bbox: [119, 247, 173, 286]
Ground truth: [118.01, 269.13, 151.59, 294.07]
Result: Incorrect
IoU: 0.2297
Generation Time: 14.10s
Predicted bbox: [118, 248, 174, 288]
Ground truth: [118.01, 269.13, 151.59, 294.07]
Result: Incorrect
IoU: 0.2593
Dataset: refcocos_test
Caption: the knife beside the pineapple, third from the left
Image: val2017/000000078266.jpg
Ground Truth Without CD With CD
Generation Time: 7.03s
Predicted bbox: [262, 139, 271, 176]
Ground truth: [284.1, 200.63, 290.89000000000004, 209.12]
Result: Incorrect
IoU: 0.0000
Generation Time: 13.93s
Predicted bbox: [254, 142, 260, 178]
Ground truth: [284.1, 200.63, 290.89000000000004, 209.12]
Result: Incorrect
IoU: 0.0000
Dataset: refcocos_test
Caption: the bowl on the same cabinet shelf as the small yellow object
Image: val2017/000000156278.jpg
Ground Truth Without CD With CD
Generation Time: 5.51s
Predicted bbox: [287, 346, 339, 387]
Ground truth: [278.52, 305.12, 336.25, 333.28000000000003]
Result: Incorrect
IoU: 0.0000
Generation Time: 9.47s
Predicted bbox: [283, 346, 337, 389]
Ground truth: [278.52, 305.12, 336.25, 333.28000000000003]
Result: Incorrect
IoU: 0.0000